Heidi Beezub Mercyhurst University Graduate Student 9/8/2017

Sample Website for showing the use of:

Filtering

Ordering

Aggregation

Examples use Lahman baseball database. First load Lahman & dplyr packages Go to “tools”" on R studio. Select “install Packages” then type in “Lahman” & hit the “install” button (it may take a while). Once Lahman package is installed, then install the “dplyr” package.

Once the packages are installed, you need to “load” them into R-studio. type “library(Lahman)” and once that is finished type in library(dplyr). Note the commands & file names are case sensitive. If you have issues, check your spelling & case.

in order to insert the code into the website, we use the “insert” on the session tool bar.

library(Lahman)
library(dplyr)

Filtering Columns & Rows:

Using Dplyr, we can filter colums & rows by using “select” for specific columns & “filter” for specific rows.

Exmple 1:

Now suppose we would like to see the homerun totals for the 1927 New York Yankees. We could run the following code:

Batting %>%
  select (playerID,yearID,teamID,HR)%>%
  filter (teamID=='NYA' & yearID==1927)
##     playerID yearID teamID HR
## 1  beallwa01   1927    NYA  0
## 2  bengobe01   1927    NYA  0
## 3  collipa01   1927    NYA  7
## 4  combsea01   1927    NYA  6
## 5  duganjo01   1927    NYA  2
## 6  durstce01   1927    NYA  0
## 7  gazelmi01   1927    NYA  0
## 8  gehrilo01   1927    NYA 47
## 9  giardjo01   1927    NYA  0
## 10 grabojo01   1927    NYA  0
## 11  hoytwa01   1927    NYA  0
## 12 koenima01   1927    NYA  3
## 13 lazzeto01   1927    NYA 18
## 14 meusebo01   1927    NYA  8
## 15 moorewi01   1927    NYA  1
## 16 morehra01   1927    NYA  1
## 17 paschbe01   1927    NYA  2
## 18 pennohe01   1927    NYA  0
## 19 pipgrge01   1927    NYA  1
## 20 ruethdu01   1927    NYA  1
## 21  ruthba01   1927    NYA 60
## 22 shawkbo01   1927    NYA  0
## 23 shockur01   1927    NYA  0
## 24 thomamy01   1927    NYA  0
## 25  weraju01   1927    NYA  1

If we want to add more nuance (like my showoff friend Ron) we can qualify for over 50 home runs with this code:

Batting %>%
  select (playerID,yearID,teamID,HR)%>%
  filter (teamID=='NYA' & yearID==1927 & HR > 50)
##   playerID yearID teamID HR
## 1 ruthba01   1927    NYA 60

Ordering:

Using Dplyr, we can order results in ascending (default) or descending order.

Exmple 2:

To find all the New York Yankees with more than 40 Home Runs we would run the following code. Notice these are in order by year smallest to largest. Ascending order is the default.

Batting %>%
    select(playerID,yearID,teamID,HR)%>%
    filter(HR>=40 & teamID=='NYA')
##     playerID yearID teamID HR
## 1   ruthba01   1920    NYA 54
## 2   ruthba01   1921    NYA 59
## 3   ruthba01   1923    NYA 41
## 4   ruthba01   1924    NYA 46
## 5   ruthba01   1926    NYA 47
## 6  gehrilo01   1927    NYA 47
## 7   ruthba01   1927    NYA 60
## 8   ruthba01   1928    NYA 54
## 9   ruthba01   1929    NYA 46
## 10 gehrilo01   1930    NYA 41
## 11  ruthba01   1930    NYA 49
## 12 gehrilo01   1931    NYA 46
## 13  ruthba01   1931    NYA 46
## 14  ruthba01   1932    NYA 41
## 15 gehrilo01   1934    NYA 49
## 16 gehrilo01   1936    NYA 49
## 17 dimagjo01   1937    NYA 46
## 18 mantlmi01   1956    NYA 52
## 19 mantlmi01   1958    NYA 42
## 20 mantlmi01   1960    NYA 40
## 21 mantlmi01   1961    NYA 54
## 22 marisro01   1961    NYA 61
## 23 jacksre01   1980    NYA 41
## 24 martiti02   1997    NYA 44
## 25 giambja01   2002    NYA 41
## 26 giambja01   2003    NYA 41
## 27 rodrial01   2005    NYA 48
## 28 rodrial01   2007    NYA 54
## 29 grandcu01   2011    NYA 41
## 30 grandcu01   2012    NYA 43

Exmple 3:

ALL players with more than 40 Home Runs but less than 60 Strike outs would need the following code:

Batting %>%
  select(playerID,yearID,teamID,HR,SO)%>%
  filter(HR>=40 & SO<=60)
##     playerID yearID teamID HR SO
## 1  hornsro01   1922    SLN 42 50
## 2  willicy01   1923    PHI 41 57
## 3    ottme01   1929    NY1 42 38
## 4   ruthba01   1929    NYA 46 60
## 5  kleinch01   1930    PHI 40 50
## 6  gehrilo01   1931    NYA 46 56
## 7   ruthba01   1931    NYA 46 51
## 8  gehrilo01   1934    NYA 49 31
## 9  gehrilo01   1936    NYA 49 46
## 10 troskha01   1936    CLE 42 58
## 11 dimagjo01   1937    NYA 46 37
## 12  mizejo01   1940    SLN 43 49
## 13  mizejo01   1947    NY1 51 42
## 14  mizejo01   1948    NY1 40 37
## 15 willite01   1949    BOS 43 48
## 16 kinerra01   1951    PIT 42 57
## 17 camparo01   1953    BRO 41 58
## 18 kluszte01   1953    CIN 40 34
## 19 rosenal01   1953    CLE 43 48
## 20 kluszte01   1954    CIN 49 35
## 21  mayswi01   1954    NY1 41 57
## 22 kluszte01   1955    CIN 47 40
## 23  mayswi01   1955    NY1 51 60
## 24 aaronha01   1957    ML1 44 58
## 25 sievero01   1957    WS1 42 55
## 26 aaronha01   1969    ATL 44 47
## 27 aaronha01   1971    ATL 47 58
## 28 aaronha01   1973    ATL 40 51
## 29 thomafr04   1993    CHA 41 54
## 30 bondsba01   2002    SFN 46 47
## 31 bondsba01   2003    SFN 45 58
## 32 bondsba01   2004    SFN 45 41
## 33 pujolal01   2004    SLN 46 52
## 34 pujolal01   2006    SLN 49 50

Exmple 4:

Now let’s look at the Philadelphia Phillies from the 1970’s. We want the best players with more than 30 Home runs. This code adds filters for the year:

Batting %>%
  select(playerID,yearID,teamID,HR,SO)%>%
  filter(HR>=30 & teamID=='PHI' & yearID>=1970 & yearID<=1979)
##     playerID yearID teamID HR  SO
## 1  johnsde01   1971    PHI 34 146
## 2  montawi01   1971    PHI 30 105
## 3  schmimi01   1974    PHI 36 138
## 4  luzingr01   1975    PHI 34 151
## 5  schmimi01   1975    PHI 38 180
## 6  schmimi01   1976    PHI 38 149
## 7  luzingr01   1977    PHI 39 140
## 8  schmimi01   1977    PHI 38 122
## 9  luzingr01   1978    PHI 35 135
## 10 schmimi01   1979    PHI 45 115

Exmple 5:

Back to all teams & players again. We want to see anyone with more than 50 home runs, but we want the list sorted with the most home runs at the top. Since ascending is the default, we need to add in code to arrange it in descending order:

Batting %>%
  select(playerID,yearID,teamID,HR,SO)%>%
  filter(HR>50)%>%
  arrange(desc(HR))
##     playerID yearID teamID HR  SO
## 1  bondsba01   2001    SFN 73  93
## 2  mcgwima01   1998    SLN 70 155
## 3   sosasa01   1998    CHN 66 171
## 4  mcgwima01   1999    SLN 65 141
## 5   sosasa01   2001    CHN 64 153
## 6   sosasa01   1999    CHN 63 171
## 7  marisro01   1961    NYA 61  67
## 8   ruthba01   1927    NYA 60  89
## 9   ruthba01   1921    NYA 59  81
## 10  foxxji01   1932    PHA 58  96
## 11 greenha01   1938    DET 58  92
## 12 howarry01   2006    PHI 58 181
## 13 gonzalu01   2001    ARI 57  83
## 14 rodrial01   2002    TEX 57 122
## 15 wilsoha01   1930    CHN 56  84
## 16 griffke02   1997    SEA 56 121
## 17 griffke02   1998    SEA 56 121
## 18  ruthba01   1920    NYA 54  80
## 19  ruthba01   1928    NYA 54  87
## 20 kinerra01   1949    PIT 54  61
## 21 mantlmi01   1961    NYA 54 112
## 22 ortizda01   2006    BOS 54 117
## 23 rodrial01   2007    NYA 54 120
## 24 bautijo02   2010    TOR 54 116
## 25 davisch02   2013    BAL 53 199
## 26 mantlmi01   1956    NYA 52  99
## 27  mayswi01   1965    SFN 52  71
## 28 fostege01   1977    CIN 52 107
## 29 mcgwima01   1996    OAK 52 112
## 30 rodrial01   2001    TEX 52 131
## 31 thomeji01   2002    CLE 52 139
## 32 kinerra01   1947    PIT 51  81
## 33  mizejo01   1947    NY1 51  42
## 34  mayswi01   1955    NY1 51  60
## 35 fieldce01   1990    DET 51 182
## 36 jonesan01   2005    ATL 51 112

Exmple 6:

Now lets look a the best batters, we want players that struck out less than 10 times, but that have been up to bat 400 or more times. We’ll arrange these by the number of strikeouts with the least number at the top (the default):

Batting %>%
  select(playerID,yearID,teamID,AB,SO)%>%
  filter(SO<10 & AB>=400)%>%
  arrange(SO)
##     playerID yearID teamID  AB SO
## 1  doyleja01   1894    NY1 422  3
## 2  seweljo01   1932    NYA 503  3
## 3  seweljo01   1925    CLE 608  4
## 4  seweljo01   1929    CLE 578  4
## 5  seweljo01   1933    NYA 524  4
## 6   wardjo01   1893    NY1 588  5
## 7  holloch01   1922    CHN 592  5
## 8  mcinnst01   1922    CLE 537  5
## 9  wanerll01   1936    PIT 414  5
## 10 wrighge01   1875    BS1 408  6
## 11 broutda01   1889    BSN 485  6
## 12 keelewi01   1894    BLN 590  6
## 13  wardjo01   1894    NY1 540  6
## 14 quinnjo02   1895    SLN 543  6
## 15 mcinnst01   1924    BSN 581  6
## 16 seweljo01   1926    CLE 578  6
## 17  wardjo01   1889    NY1 479  7
## 18 crossla01   1893    PHI 415  7
## 19 quinnjo02   1893    SLN 547  7
## 20 crossla01   1894    PHI 529  7
## 21 vaughfa01   1896    CIN 433  7
## 22 cochrmi01   1927    PHA 432  7
## 23 seweljo01   1927    CLE 569  7
## 24 traynpi01   1929    PIT 540  7
## 25 muelldo01   1956    NY1 453  7
## 26 connoro01   1885    NY1 455  8
## 27 glassja01   1887    IN3 483  8
## 28 glassja01   1890    NY1 512  8
## 29 donovpa01   1893    PIT 499  8
## 30 dungasa01   1893    CHN 465  8
## 31 pinknge01   1893    LS3 446  8
## 32 brodist01   1894    BLN 573  8
## 33 quinnjo02   1894    SLN 405  8
## 34 bierblo01   1895    PIT 466  8
## 35 crossla01   1895    PHI 535  8
## 36   hoydu01   1895    CIN 429  8
## 37 roushed01   1921    CIN 418  8
## 38 collied01   1923    CHA 505  8
## 39 collied01   1925    CHA 425  8
## 40 speaktr01   1927    WS1 523  8
## 41 cochrmi01   1929    PHA 514  8
## 42 seweljo01   1931    NYA 484  8
## 43 wanerll01   1933    PIT 500  8
## 44 verbaem01   1947    PHI 540  8
## 45 ansonca01   1883    CHN 413  9
## 46 broutda01   1887    DTN 500  9
## 47   hoydu01   1893    WAS 564  9
## 48 bierblo01   1894    PIT 525  9
## 49 broutda01   1894    BLN 525  9
## 50 milledo01   1894    SLN 481  9
## 51 keelewi01   1896    BLN 544  9
## 52 mckeaed01   1896    CL4 571  9
## 53 speaktr01   1918    CLE 471  9
## 54  dealch01   1921    CHN 422  9
## 55 mcinnst01   1921    BOS 584  9
## 56 severha01   1921    SLA 472  9
## 57  highan01   1926    BSN 476  9
## 58 summaho01   1926    CLE 581  9
## 59 seweljo01   1928    CLE 588  9
## 60  ricesa01   1929    WS1 616  9
## 61 leachfr01   1931    NY1 515  9
## 62 busched01   1945    PHA 416  9
## 63 holmeto01   1945    BSN 636  9
## 64 boudrlo01   1948    CLE 560  9
## 65 mitchda01   1952    CLE 511  9

Aggregation:

Dplyr can also do sums, averages, max or minimum to combine total from rows.

Exmple 7:

The following code totals all of Babe Ruth’s home runs.

Batting%>%
  filter(playerID=='ruthba01')%>%
  group_by(playerID)%>%
  summarize(career_HR=sum(HR))
## # A tibble: 1 x 2
##   playerID career_HR
##      <chr>     <int>
## 1 ruthba01       714

Exmple 8:

Find the Career HR totals for all players, but limit the list to those who have a career total of 600 or more homeruns. Put the highest totals at the top.

Batting%>%
  group_by(playerID)%>%
  summarize(career_HR=sum(HR))%>%
  filter(career_HR>=600)%>%
  arrange(desc(career_HR))
## # A tibble: 8 x 2
##    playerID career_HR
##       <chr>     <int>
## 1 bondsba01       762
## 2 aaronha01       755
## 3  ruthba01       714
## 4 rodrial01       696
## 5  mayswi01       660
## 6 griffke02       630
## 7 thomeji01       612
## 8  sosasa01       609

Exmple 9:

What players have the highest average (mean), per season Home run totals filer by at lease an average of 30.

Batting%>%
  group_by(playerID)%>%
  summarize(ave_HR=mean(HR))%>%
  filter(ave_HR>30)%>%
  arrange(desc(ave_HR))
## # A tibble: 13 x 2
##     playerID   ave_HR
##        <chr>    <dbl>
##  1 pujolal01 36.93750
##  2 bondsba01 34.63636
##  3 mcgwima01 34.29412
##  4 kinerra01 33.54545
##  5 aaronha01 32.82609
##  6 bryankr01 32.50000
##  7  ruthba01 32.45455
##  8  sosasa01 32.05263
##  9 cabremi01 31.85714
## 10 belleal01 31.75000
## 11 rodrial01 31.63636
## 12 schmimi01 30.44444
## 13 abreujo02 30.33333

Exmple 10:

Now we can add some more nuance: What players, since 1970 have hit more than 50 HR at least once (don’t list a player more than once). Since we only want a name listed once we’re looking for a maximum:

Batting%>%
  filter(yearID>=1970)%>%
  group_by(playerID)%>%
  summarize(max_HR=max(HR))%>%
  filter(max_HR>50)
## # A tibble: 14 x 2
##     playerID max_HR
##        <chr>  <dbl>
##  1 bautijo02     54
##  2 bondsba01     73
##  3 davisch02     53
##  4 fieldce01     51
##  5 fostege01     52
##  6 gonzalu01     57
##  7 griffke02     56
##  8 howarry01     58
##  9 jonesan01     51
## 10 mcgwima01     70
## 11 ortizda01     54
## 12 rodrial01     57
## 13  sosasa01     66
## 14 thomeji01     52

If only want names without the number of home runs we would add select(playerID):

Batting%>%
  filter(yearID>=1970)%>%
  group_by(playerID)%>%
  summarize(max_HR=max(HR))%>%
  filter(max_HR>50)%>%
  select(playerID)
## # A tibble: 14 x 1
##     playerID
##        <chr>
##  1 bautijo02
##  2 bondsba01
##  3 davisch02
##  4 fieldce01
##  5 fostege01
##  6 gonzalu01
##  7 griffke02
##  8 howarry01
##  9 jonesan01
## 10 mcgwima01
## 11 ortizda01
## 12 rodrial01
## 13  sosasa01
## 14 thomeji01

That concludes our examples & usinf dplyr!